{
  "nbformat": 4,
  "nbformat_minor": 2,
  "metadata": {},
  "cells": [
    {
      "metadata": {},
      "source": [
        "<td>\n",
        "   <a target=\"_blank\" href=\"https://labelbox.com\" ><img src=\"https://labelbox.com/blog/content/images/2021/02/logo-v4.svg\" width=256/></a>\n",
        "</td>\n",
        "\n"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "\n",
        "<td>\n",
        "\n",
        "<a href=\"https://colab.research.google.com/github/Labelbox/labelbox-python/blob/master/examples/annotation_import/conversational_LLM_data_generation.ipynb\" target=\"_blank\"><img\n",
        "src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n",
        "</td>\n",
        "\n",
        "<td>\n",
        "\n",
        "<a href=\"https://github.com/Labelbox/labelbox-python/tree/master/examples/annotation_import/conversational_LLM_data_generation.ipynb\" target=\"_blank\"><img\n",
        "src=\"https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white\" alt=\"GitHub\"></a>\n",
        "</td>"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "# LLM Data Generation with MAL and Ground Truth\n",
        "This demo is meant to showcase how to generate prompts and responses to fine-tune large language models (LLMs) using MAL and Ground truth"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "!pip install -q \"labelbox[data]\""
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Set up "
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "import labelbox as lb\n",
        "import uuid"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Replace with your API key"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "API_KEY = \"\"\n",
        "client = lb.Client(api_key=API_KEY)"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Supported annotations for LLM data generation\n",
        "Currently, we only support NDJson format for prompt and responses"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "## Prompt:"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "### Classification: Free-form text"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "prompt_annotation_ndjson = {\n",
        "  \"name\": \"Follow the prompt and select answers\",\n",
        "  \"answer\": \"This is an example of a prompt\"\n",
        "}"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "# Responses:"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "### Classification: Radio (single-choice)"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "response_radio_annotation_ndjson= {\n",
        "  \"name\": \"response_radio\",\n",
        "  \"answer\": {\n",
        "      \"name\": \"response_a\"\n",
        "    }\n",
        "}"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "### Classification: Free-form text"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "# Only NDJson is currently supported\n",
        "response_text_annotation_ndjson = {\n",
        "  \"name\": \"Provide a reason for your choice\",\n",
        "  \"answer\": \"This is an example of a response text\"\n",
        "}\n"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "### Classification: Checklist (multi-choice)"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "response_checklist_annotation_ndjson = {\n",
        "  \"name\": \"response_checklist\",\n",
        "  \"answer\": [\n",
        "    {\n",
        "      \"name\": \"response_a\"\n",
        "    },\n",
        "    {\n",
        "      \"name\": \"response_c\"\n",
        "    }\n",
        "  ]\n",
        "}"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Step 1: Create a project and data rows in Labelbox UI\n",
        "\n",
        "Currently we do not support this workflow through the SDK.\n",
        "#### Workflow:\n",
        "\n",
        "1. Navigate to annotate and select ***New project***\n",
        "\n",
        "2. Select ***LLM data generation*** and then select ***Humans generate prompts and responses***\n",
        "\n",
        "3. Name your project, select ***create a new dataset*** and name your dataset. (data rows will be generated automatically in \n",
        "this step)\n",
        "\n",
        "\n"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "# Enter the project id\n",
        "project_id = \"\"\n",
        "\n",
        "# Select one of the global keys from the data rows generated\n",
        "global_key = \"\""
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Step 2 : Create/select an Ontology in Labelbox UI"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "Currently we do not support this workflow through the SDK\n",
        "#### Workflow: \n",
        "1. In your project, navigate to ***Settings*** and ***Label editor***\n",
        "\n",
        "2. Click on ***Edit***\n",
        "\n",
        "3. Create a new ontology and add the features used in this demo\n",
        "\n"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "#### For this demo the following ontology was generated in the UI: "
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "ontology_json = \"\"\"\n",
        "{\n",
        " \"tools\": [],\n",
        " \"relationships\": [],\n",
        " \"classifications\": [\n",
        "  {\n",
        "   \"schemaNodeId\": \"clpvq9d0002yt07zy0khq42rp\",\n",
        "   \"featureSchemaId\": \"clpvq9d0002ys07zyf2eo9p14\",\n",
        "   \"type\": \"prompt\",\n",
        "   \"name\": \"Follow the prompt and select answers\",\n",
        "   \"archived\": false,\n",
        "   \"required\": true,\n",
        "   \"options\": [],\n",
        "   \"instructions\": \"Follow the prompt and select answers\",\n",
        "   \"minCharacters\": 5,\n",
        "   \"maxCharacters\": 100\n",
        "  },\n",
        "  {\n",
        "   \"schemaNodeId\": \"clpvq9d0002yz07zy0fjg28z7\",\n",
        "   \"featureSchemaId\": \"clpvq9d0002yu07zy28ik5w3i\",\n",
        "   \"type\": \"response-radio\",\n",
        "   \"name\": \"response_radio\",\n",
        "   \"instructions\": \"response_radio\",\n",
        "   \"scope\": \"global\",\n",
        "   \"required\": true,\n",
        "   \"archived\": false,\n",
        "   \"options\": [\n",
        "    {\n",
        "     \"schemaNodeId\": \"clpvq9d0002yw07zyci2q5adq\",\n",
        "     \"featureSchemaId\": \"clpvq9d0002yv07zyevmz1yoj\",\n",
        "     \"value\": \"response_a\",\n",
        "     \"label\": \"response_a\",\n",
        "     \"position\": 0,\n",
        "     \"options\": []\n",
        "    },\n",
        "    {\n",
        "     \"schemaNodeId\": \"clpvq9d0002yy07zy8pe48zdj\",\n",
        "     \"featureSchemaId\": \"clpvq9d0002yx07zy0jvmdxk8\",\n",
        "     \"value\": \"response_b\",\n",
        "     \"label\": \"response_b\",\n",
        "     \"position\": 1,\n",
        "     \"options\": []\n",
        "    }\n",
        "   ]\n",
        "  },\n",
        "  {\n",
        "   \"schemaNodeId\": \"clpvq9d0002z107zygf8l62ys\",\n",
        "   \"featureSchemaId\": \"clpvq9d0002z007zyg26115f9\",\n",
        "   \"type\": \"response-text\",\n",
        "   \"name\": \"provide_a_reason_for_your_choice\",\n",
        "   \"instructions\": \"Provide a reason for your choice\",\n",
        "   \"scope\": \"global\",\n",
        "   \"required\": true,\n",
        "   \"archived\": false,\n",
        "   \"options\": [],\n",
        "   \"minCharacters\": 5,\n",
        "   \"maxCharacters\": 100\n",
        "  },\n",
        "  {\n",
        "   \"schemaNodeId\": \"clpvq9d0102z907zy8b10hjcj\",\n",
        "   \"featureSchemaId\": \"clpvq9d0002z207zy6xla7f82\",\n",
        "   \"type\": \"response-checklist\",\n",
        "   \"name\": \"response_checklist\",\n",
        "   \"instructions\": \"response_checklist\",\n",
        "   \"scope\": \"global\",\n",
        "   \"required\": true,\n",
        "   \"archived\": false,\n",
        "   \"options\": [\n",
        "    {\n",
        "     \"schemaNodeId\": \"clpvq9d0102z407zy0adq0rfr\",\n",
        "     \"featureSchemaId\": \"clpvq9d0002z307zy6dqb8xsw\",\n",
        "     \"value\": \"response_a\",\n",
        "     \"label\": \"response_a\",\n",
        "     \"position\": 0,\n",
        "     \"options\": []\n",
        "    },\n",
        "    {\n",
        "     \"schemaNodeId\": \"clpvq9d0102z607zych8b2z5d\",\n",
        "     \"featureSchemaId\": \"clpvq9d0102z507zyfwfgacrn\",\n",
        "     \"value\": \"response_c\",\n",
        "     \"label\": \"response_c\",\n",
        "     \"position\": 1,\n",
        "     \"options\": []\n",
        "    },\n",
        "    {\n",
        "     \"schemaNodeId\": \"clpvq9d0102z807zy03y7gysp\",\n",
        "     \"featureSchemaId\": \"clpvq9d0102z707zyh61y5o3u\",\n",
        "     \"value\": \"response_d\",\n",
        "     \"label\": \"response_d\",\n",
        "     \"position\": 2,\n",
        "     \"options\": []\n",
        "    }\n",
        "   ]\n",
        "  }\n",
        " ],\n",
        " \"realTime\": false\n",
        "}\n",
        "\n",
        "\"\"\""
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Step 3: Create the annotations payload"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "label_ndjson = []\n",
        "for annotations in [\n",
        "    prompt_annotation_ndjson,\n",
        "    response_radio_annotation_ndjson,\n",
        "    response_text_annotation_ndjson,\n",
        "    response_checklist_annotation_ndjson\n",
        "    ]:\n",
        "  annotations.update({\n",
        "      \"dataRow\": {\n",
        "          \"globalKey\": global_key\n",
        "      }\n",
        "  })\n",
        "  label_ndjson.append(annotations)"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "## Step 4: Upload annotations to a project as pre-labels or complete labels"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "project = client.get_project(project_id=project_id)"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "#### Model Assisted Labeling (MAL)"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "upload_job = lb.MALPredictionImport.create_from_objects(\n",
        "    client = client,\n",
        "    project_id = project.uid,\n",
        "    name=f\"mal_job-{str(uuid.uuid4())}\",\n",
        "    predictions=label_ndjson)\n",
        "\n",
        "upload_job.wait_until_done()\n",
        "print(\"Errors:\", upload_job.errors)\n",
        "print(\"Status of uploads: \", upload_job.statuses)"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {},
      "source": [
        "#### Label Import"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "upload_job = lb.LabelImport.create_from_objects(\n",
        "    client = client,\n",
        "    project_id = project.uid,\n",
        "    name=\"label_import_job\"+str(uuid.uuid4()),\n",
        "    labels=label_ndjson)\n",
        "\n",
        "upload_job.wait_until_done();\n",
        "print(\"Errors:\", upload_job.errors)\n",
        "print(\"Status of uploads: \", upload_job.statuses)"
      ],
      "cell_type": "code",
      "outputs": [],
      "execution_count": null
    }
  ]
}